skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Wang, Chengyue"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. As AI systems grow increasingly specialized and complex, managing hardware heterogeneity becomes a pressing challenge. How can we efficiently coordinate and synchronize heterogeneous hardware resources to achieve high utilization? How can we minimize the friction of transitioning between diverse computation phases, reducing costly stalls from initialization, pipeline setup, or drain? Our insight is that a network abstraction at the ISA level naturally unifies heterogeneous resource orchestration and phase transitions. This paper presents a Reconfigurable Stream Network Architecture (RSN), a novel ISA abstraction designed for the DNN domain. RSN models the datapath as a circuit-switched network with stateful functional units as nodes and data streaming on the edges. Programming a computation corresponds to triggering a path. Software is explicitly exposed to the compute and communication latency of each functional unit, enabling precise control over data movement for optimizations such as compute-communication overlap and layer fusion. As nodes in a network naturally differ, the RSN abstraction can efficiently virtualize heterogeneous hardware resources by separating control from the data plane, enabling low instruction-level intervention. We build a proof-of-concept design RSN-XNN on VCK190, a heterogeneous platform with FPGA fabric and AI engines. Compared to the SOTA solution on this platform, it reduces latency by 6.1x and improves throughput by 2.4x–3.2x. Compared to the T4 GPU with the same FP32 performance, it matches latency with only 18% of the memory bandwidth. Compared to the A100 GPU at the same 7nm process node, it achieves 2.1x higher energy efficiency in FP32. 
    more » « less
    Free, publicly-accessible full text available June 20, 2026
  2. The Fast Fourier Transform (FFT) is a foundational algorithm widely used in fields like digital signal processing and machine learning. While High-Level Synthesis (HLS) tools have boosted the development of customized hardware accelerators in these fields, existing FFT HLS IP libraries often suffer from low throughput and poor usability due to inadequate exploitation of potential parallelism. In constract, many high-throughput RTL FFT designs lack portability and flexibility, limiting their practical adoption. To get rid of this predicament, we conducted an in-depth analysis of the FFT algorithm’s loop structure, uncovering hierarchical parallelism to optimize performance. Based on these insights, we developed a general FFT HLS generator, HP-FFT, which supports multiple functionalities and a wide range of customizable parallelism settings to meet diverse user requirements. Experimental results demonstrate that our proposed HLS generator matches or outperforms state-of-the-art RTL/HLS IP libraries or generators, while enabling users to easily generate architectures that balance resource efficiency and high throughput to suit various application needs. 
    more » « less
    Free, publicly-accessible full text available May 4, 2026